Content
- The grammar of graphics
- The major components of layers
- Hands on practice
- Visualizations based on the gg approach
The grammar of graphics is about grammatical rules for creating perceivable graphs, or what we call graphics. (Leland Wilkinson, 2005).
Take the analogy: good grammar is just the first step in creating a good sentence.
DATA : a set of data operations that create variables from datasets,TRANS : variable transformations (e.g., rank),SCALE : scale transformations (e.g., log),COORD : a coordinate system (e.g., polar),ELEMENT : graphs (e.g., points) and their aesthetic attributes (e.g., color),GUIDE : one or more guides (axes, legends, etc.).Algebra, the operations that allow us to combine variables and specify dimensions of graphs.Scales involves the representation of variables on measured dimensions.Statistics covers the functions that allow graphs to change their appearance and representation schemes.Geometry covers the creation of geometric graphs from variables.

Graphical primitives
geom_path()geom_rect()geom_poligon()One variable
geom_bar()geom_histogram()geom_density()Two variables
geom_smooth()geom_point()geom_count()geom_jitter()geom_boxplot().geom_violin()Three variables
geom_contour()geom_tile()geom_raster()| Daily temprature data | ||||||
| station_id | month | day | temperature | flag | date | location |
|---|---|---|---|---|---|---|
| USC00042319 | 01 | 1 | 51.0 | S | 0-01-01 | Death Valley |
| USC00042319 | 01 | 2 | 51.2 | S | 0-01-02 | Death Valley |
| USC00042319 | 01 | 3 | 51.3 | S | 0-01-03 | Death Valley |
| USC00042319 | 01 | 4 | 51.4 | S | 0-01-04 | Death Valley |
| USC00042319 | 01 | 5 | 51.6 | S | 0-01-05 | Death Valley |
| USC00042319 | 01 | 6 | 51.7 | S | 0-01-06 | Death Valley |
p <- ggplot(temps_long,
aes(x = date,
y = temperature,
color = location)
) +
geom_line(linewidth = 1) +
scale_x_date(name = "month",
limits = c(ymd("0000-01-01"), ymd("0001-01-04")),
breaks = c(ymd("0000-01-01"), ymd("0000-04-01"), ymd("0000-07-01"),
ymd("0000-10-01"), ymd("0001-01-01")),
labels = c("Jan", "Apr", "Jul", "Oct", "Jan"), expand = c(1/366, 0)) +
scale_y_continuous(limits = c(19.9, 107),
breaks = seq(20, 100, by = 20),
name = "temperature (°F)") +
scale_color_OkabeIto(order = c(1:3, 7), name = NULL) +
theme_dviz_grid() +
theme(legend.title.align = 0.5)# Create plot
fig, ax = plt.subplots(figsize=(9, 5))
# Use seaborn lineplot; pass palette by mapping
sns.lineplot(
data=lf,
x='date',
y='temperature',
hue='location',
palette=palette_map,
linewidth=1.5, # similar to geom_line linewidth
ax=ax
)
# X-axis limits and breaks (use valid years 2000-01-01 to 2001-01-04)
xmin = pd.to_datetime("2000-01-01")Seaborn(np.float64(10957.0), np.float64(11326.0))
(19.9, 107.0)
heatmapPreprocessing:
location & monthmonth numbers with names| Mean temperature per month | ||
| location | month | mean |
|---|---|---|
| Death Valley | Jan | 53.45161 |
| Death Valley | Feb | 59.94483 |
| Death Valley | Mar | 68.44839 |
| Death Valley | Apr | 76.29333 |
| Death Valley | May | 86.60645 |
| Death Valley | Jun | 95.54667 |
statistical transformationsggplot2 stat_ functions |
|
| Table adapted from Hadley Wickham (2016), | |
| Name | Description |
|---|---|
| bin | Divide continuous range into bins, and count number of points in each |
| boxplot | Compute statistics necessary for boxplot |
| contour | Calculate contour lines |
| density | Compute 1d density estimate |
| identity | Identity transformation, f(x) = x |
| jitter | Jitter values by adding small random value |
| Calculate values for quantile-quantile plot | |
| quantile | Quantile regression |
| smooth | Smoothed conditional mean of y given x |
| summary | Aggregate values of y for given x |
| unique | Remove duplicated observations |
Blue jay relationship between body mass and head length.
| Blue jay dataset | ||||||||
| BirdID | KnownSex | BillDepth | BillWidth | BillLength | Head | Mass | Skull | Sex |
|---|---|---|---|---|---|---|---|---|
| 0000-00000 | M | 8.26 | 9.21 | 25.92 | 56.58 | 73.30 | 30.66 | 1 |
| 1142-05901 | M | 8.54 | 8.76 | 24.99 | 56.36 | 75.10 | 31.38 | 1 |
| 1142-05905 | M | 8.39 | 8.78 | 26.07 | 57.32 | 70.25 | 31.25 | 1 |
| 1142-05907 | F | 7.78 | 9.30 | 23.48 | 53.77 | 65.50 | 30.29 | 0 |
| 1142-05909 | M | 8.71 | 9.84 | 25.47 | 57.32 | 74.90 | 31.85 | 1 |
| 1142-05911 | F | 7.28 | 9.30 | 22.25 | 52.25 | 63.90 | 30.00 | 0 |
blue_jays_base <- ggplot(blue_jays, aes(Mass, Head)) +
scale_x_continuous(limits = c(57, 82), expand = c(0, 0), name = "body mass (g)") +
scale_y_continuous(limits = c(49, 61), expand = c(0, 0), name = "head length (mm)" ) +
theme_dviz_grid()
blue_jays_base +
stat_density_2d(color = "black", size = 0.4, binwidth = 0.004) +
geom_point(color = "black", size = 1.5, alpha = 1/3)sexCommon applications:
Heatmaps, aggregate values into grid cells to display intensity across two dimensionsPrompt: Given a pandas dataframes with more than 200 million rows and an 'mz' column having more thatn 26 million unique values. How can the table be aggregated in such a way that we can create a heat mpa wint 'mz' on the vertical axis, time on the horizontal axis and intensity on the 'z' axis (color)?